9 research outputs found
A Graph-structured Dataset for Wikipedia Research
Wikipedia is a rich and invaluable source of information. Its central place
on the Web makes it a particularly interesting object of study for scientists.
Researchers from different domains used various complex datasets related to
Wikipedia to study language, social behavior, knowledge organization, and
network theory. While being a scientific treasure, the large size of the
dataset hinders pre-processing and may be a challenging obstacle for potential
new studies. This issue is particularly acute in scientific domains where
researchers may not be technically and data processing savvy. On one hand, the
size of Wikipedia dumps is large. It makes the parsing and extraction of
relevant information cumbersome. On the other hand, the API is straightforward
to use but restricted to a relatively small number of requests. The middle
ground is at the mesoscopic scale when researchers need a subset of Wikipedia
ranging from thousands to hundreds of thousands of pages but there exists no
efficient solution at this scale.
In this work, we propose an efficient data structure to make requests and
access subnetworks of Wikipedia pages and categories. We provide convenient
tools for accessing and filtering viewership statistics or "pagecounts" of
Wikipedia web pages. The dataset organization leverages principles of graph
databases that allows rapid and intuitive access to subgraphs of Wikipedia
articles and categories. The dataset and deployment guidelines are available on
the LTS2 website \url{https://lts2.epfl.ch/Datasets/Wikipedia/}
Anomaly detection in the dynamics of web and social networks
In this work, we propose a new, fast and scalable method for anomaly
detection in large time-evolving graphs. It may be a static graph with dynamic
node attributes (e.g. time-series), or a graph evolving in time, such as a
temporal network. We define an anomaly as a localized increase in temporal
activity in a cluster of nodes. The algorithm is unsupervised. It is able to
detect and track anomalous activity in a dynamic network despite the noise from
multiple interfering sources. We use the Hopfield network model of memory to
combine the graph and time information. We show that anomalies can be spotted
with a good precision using a memory network. The presented approach is
scalable and we provide a distributed implementation of the algorithm. To
demonstrate its efficiency, we apply it to two datasets: Enron Email dataset
and Wikipedia page views. We show that the anomalous spikes are triggered by
the real-world events that impact the network dynamics. Besides, the structure
of the clusters and the analysis of the time evolution associated with the
detected events reveals interesting facts on how humans interact, exchange and
search for information, opening the door to new quantitative studies on
collective and social behavior on large and dynamic datasets.Comment: The Web Conference 2019, 10 pages, 7 figure
Anomaly detection in the dynamics of web and social networks
In this work, we propose a new, fast and scalable method for anomaly detection in large time-evolving graphs. It may be a static graph with dynamic node attributes (e.g. time-series), or a graph evolving in time, such as a temporal network. We define an anomaly as a localized increase in temporal activity in a cluster of nodes. The algorithm is unsupervised. It is able to detect and track anomalous activity in a dynamic network despite the noise from multiple interfering sources. We use the Hopfield network model of memory to combine the graph and time information. We show that anomalies can be spotted with good precision using a memory network. The presented approach is scalable and we provide a distributed implementation of the algorithm. To demonstrate its efficiency, we apply it to two datasets: Enron Email dataset and Wikipedia page views. We show that the anomalous spikes are triggered by the real-world events that impact the network dynamics. Besides, the structure of the clusters and the analysis of the time evolution associated with the detected events reveals interesting facts on how humans interact, exchange and search for information, opening the door to new quantitative studies on collective and social behavior on large and dynamic datasets
Dynamic pattern recognition in large-scale graphs with applications to social networks
A graph is a versatile data structure facilitating representation of interactions among objects in various complex systems. Very often these objects have attributes whose measurements change over time, reflecting the dynamics of the system. This general data framework can be used in many fields to represent complex data structures: brain networks and neuronal spikes, web networks and clickstreams, social networks and activity of the users, among others. In all of these examples, the structural and dynamic components of the data are inseparable, which significantly complicates the detection, analysis, and interpretation of patterns that emerge in the networks. The increasing size and complexity of graph-structured data require scalable and interpretable algorithms for dynamic pattern detection in such systems.
In this dissertation, we present an unsupervised approach for dynamic pattern detection in large-scale graphs. In this approach, we combine intuitions derived from attention mechanisms, Hopfield networks, and memory networks to build scalable, efficient, and interpretable algorithms. We then demonstrate multiple applications of this approach in recommendation systems, information recovery algorithms, and collective behavior studies. Additionally, we use our algorithm to detect dynamic activity patterns in social and communication networks. We conduct extensive experiments on Wikipedia data, detecting and analyzing patterns in the viewership activity in its web network. To study the collective behavior of Wikipedia readers, we develop an automated pattern interpretation model, which allows for comparison of trending topics across multiple language editions of Wikipedia. The results of the experiments reveal provocative insights into how people interact and search for information in online social networking environments, opening new avenues for future research on collective behavior analysis at a large scale.
Finally, we present a distributed data processing framework for Wikipedia server logs that allows others to reproduce all pattern detection experiments presented in this thesis and to conduct similar collective behavior studies on the latest data
Spikyball Sampling: Exploring Large Networks via an Inhomogeneous Filtered Diffusion
Studying real-world networks such as social networks or web networks is a challenge. These networks often combine a complex, highly connected structure together with a large size. We propose a new approach for large scale networks that is able to automatically sample user-defined relevant parts of a network. Starting from a few selected places in the network and a reduced set of expansion rules, the method adopts a filtered breadth-first search approach, that expands through edges and nodes matching these properties. Moreover, the expansion is performed over a random subset of neighbors at each step to mitigate further the overwhelming number of connections that may exist in large graphs. This carries the image of a “spiky” expansion. We show that this approach generalize previous exploration sampling methods, such as Snowball or Forest Fire and extend them. We demonstrate its ability to capture groups of nodes with high interactions while discarding weakly connected nodes that are often numerous in social networks and may hide important structures
A Lie-Group Adaptive Method to Identify the Radiative Coefficients in Parabolic Partial Differential Equations
In this work, we propose a new, fast and scalable method for anomaly detection in large time-evolving graphs. It may be a static graph with dynamic node attributes (e.g. time-series), or a graph evolving in time, such as a temporal network. We define an anomaly as a localized increase in temporal activity in a cluster of nodes. The algorithm is unsupervised. It is able to detect and track anomalous activity in a dynamic network despite the noise from multiple interfering sources. We use the Hopfield network model of memory to combine the graph and time information. We show that anomalies can be spotted with good precision using a memory network. The presented approach is scalable and we provide a distributed implementation of the algorithm.To demonstrate its efficiency, we apply it to two datasets: Enron Email dataset and Wikipedia page views. We show that the anomalous spikes are triggered by the real-world events that impact the network dynamics. Besides, the structure of the clusters and the analysis of the time evolution associated with the detected events reveals interesting facts on how humans interact, exchange and search for information, opening the door to new quantitative studies on collective and social behavior on large and dynamic datasets
What is Trending on Wikipedia? Capturing Trends and Language Biases Across Wikipedia Editions
In this work, we propose an automatic evaluation and comparison of the browsing behavior of Wikipedia readers that can be applied to any language editions of Wikipedia. As an example, we focus on English, French, and Russian languages during the last four months of 2018. The proposed method has three steps. Firstly, it extracts the most trending articles over a chosen period of time. Secondly, it performs a semi-supervised topic extraction and thirdly, it compares topics across languages. The automated processing works with the data that combines Wikipedia's graph of hyperlinks, pageview statistics and summaries of the pages. The results show that people share a common interest and curiosity for entertainment, e.g. movies, music, sports independently of their language. Differences appear in topics related to local events or about cultural particularities. Interactive visualizations showing clusters of trending pages in each language edition are available online https://wiki-insights.epfl.ch/wikitrend